Algorithms for Association Rules
نویسنده
چکیده
Association rules are ”if-then rules” with two measures which quantify the support and confidence of the rule for a given data set. Having their origin in market basked analysis, association rules are now one of the most popular tools in data mining. This popularity is to a large part due to the availability of efficient algorithms following from the development of the Apriori algorithm. We will review the basic Apriori algorithm and discuss variants for distributed data, inclusion of constraints and data taxonomies. The review ends with an outlook on tools which have the potential to deal with long itemsets and considerably reduce the amount of (uninteresting) itemsets returned. The discussion will focus on the problem of finding frequent itemsets. 1 Searching for Associations Association rule mining originated in market basket analysis which aims at understanding the behaviour and (shopping) interests of retail customers. This understanding helps with product placement and direct marketing. A major challenge is that the retail customer often has 10,000 or more items to choose from so that one can get widely different market baskets. If one does not consider the variations in amounts of the same items but only the different items in a market basket one has 2 ≈ 10 different potential market baskets. In practice, however, no market baskets will contain very many items. However, even if one only considers market baskets with, say, up to 30 items one gets more than 10 possibilities. This is an instance of the curse of dimensionality. Any data mining algorithm will somehow have to deal with this curse. Association rule mining for market basket analysis discovers patterns in observed market baskets which occur frequently. Besides in retail, the market analysis framework has also been used in the health and other service industries. Furthermore, applications of association rule mining are now used far beyond market basket analysis and include the detection of network intrusions or attacks from the logs of web servers and the usage of webserver pages. Association rule discovery is also used by scientists to mine DNA sequences and and protein structure and to investigate time series. Two types of patterns can be found in association rule mining: A first type are “if-then-rules” and are of the form: “If a customer buys milk then she also buys bread”. A second type relates to cooccurrence of items in the market basket: “A customer buys bread and milk together”. The discovery of the second pattern is simpler than the first, moreover, one can see that the discovery of the first pattern can be based on the discovery of the second one. We will thus focus here on the second type of pattern. One models the potential items as a set I = {a1, . . . , am} and a transaction as a subset of I . Any patterns are derived from the transaction database which is a sequence of itemsets DB = (T1, . . . , Tn). A pattern of the second type can be described as an itemset A ⊂ I as well. A transaction Ti is said to support an itemset A if A ⊂ Ti. The number (sometimes the proportion) of all transactions which support A is called the support of A in DB and is denoted by σ(A) = #{i | A ⊂ Ti}. A frequent itemset A is defined as an itemset with a support which is larger than some threshold s0. The choice of this threshold by the user determines how many frequent itemsets are found but also how useful these itemsets will be. Frequent itemset mining aims to find all itemsets A with support larger than the threshold, i.e., for which σ(A) ≥ s0. The naive approach would be to determine the support of all possible itemsets and select the ones which are frequent. This is infeasible, as the the set of all possible itemsets is the powerset of the set of all items and in most applications very large. Association rule mining algorithms will have to find ways to detect the frequent itemsets without visiting all the possible itemsets. Note that the itemsets form a Boolean lattice, see Figure 1 for a simple example. The search for all frequent itemsets typically starts with the or minimal element {} {bread} {coffee} {juice} {milk} {bread, coffee} {milk, coffee} {milk, bread} {milk, juice} {bread. juice} {coffee, juice} {milk, coffee, juice} {bread, coffee, juice} {milk, bread, coffee} {milk, bread, juice} {milk, bread, coffee, juice} Fig. 1. Boolean lattice of breakfast itemsets of this lattice and covers as little as possible. In the next section we review the basic Apriori algorithm, then, in section 3 we consider some variants and in the last section we will look at the problem of finding very large frequent itemsets. 2 Breadth First Search: Apriori Algorithm The Apriori algorithm [1] determines the support of itemsets in a levelwise BFS fashion. First it finds the supports of 1-itemsets (the itemsets with only one element) then of 2-itemsets etc: C1 is the set of all one-itemsets, k = 1 While Ck 6= ∅ scan database to determine support σ(A) for all A ∈ Ck extract frequent itemsets from Ck into Lk generate Ck+1 k := k + 1 The algorithm does not determine the supports of all possible itemsets, instead, it uses a clever strategy to determine candidates for frequent itemsets, i.e., it finds sets Ck of k-itemsets which contain all the frequent itemsets but not much else. The main observation on which such a selection of candidates is based is the apriori principle which we formulate as a simple lemma: Lemma 1. For any itemset A ⊂ I with σ(A) ≥ s0 and any other itemset B ⊂ I with B ⊂ A one has σ(B) ≤ s0. Proof. For any B ⊂ A one has #{i | A ⊂ Ti} ≤ #{i | B ⊂ Ti} and so σ(A) ≤ σ(B). This is the antimonotonicity of the support function σ(·). As σ(A) ≤ s0 it follows that σ(B) ≤ s0. So any subset of a frequent itemset has to be frequent and once all the frequent itemsets up to size k are konwn then we know all the (proper) subsets of the frequent k+1 itemsets are known. Thus one chooses as the set of candidates Ck+1 just the k + 1 itemsets which only have frequent proper subsets. This, in fact, is the minimal set of candidates one can find without further data scan given the frequent itemsets up to level k. The simplest way to determine the candidate itemset Ck+1 would be to enumerate all possible k+1 itemsets and remove the ones which have infrequent subsets. This, however, is again not feasible for larger n and k as the number of possible k+1 itemsets is ( m
منابع مشابه
Introducing an algorithm for use to hide sensitive association rules through perturb technique
Due to the rapid growth of data mining technology, obtaining private data on users through this technology becomes easier. Association Rules Mining is one of the data mining techniques to extract useful patterns in the form of association rules. One of the main problems in applying this technique on databases is the disclosure of sensitive data by endangering security and privacy. Hiding the as...
متن کاملOptimizing Membership Functions using Learning Automata for Fuzzy Association Rule Mining
The Transactions in web data often consist of quantitative data, suggesting that fuzzy set theory can be used to represent such data. The time spent by users on each web page is one type of web data, was regarded as a trapezoidal membership function (TMF) and can be used to evaluate user browsing behavior. The quality of mining fuzzy association rules depends on membership functions and since t...
متن کاملNew Approaches to Analyze Gasoline Rationing
In this paper, the relation among factors in the road transportation sector from March, 2005 to March, 2011 is analyzed. Most of the previous studies have economical point of view on gasoline consumption. Here, a new approach is proposed in which different data mining techniques are used to extract meaningful relations between the aforementioned factors. The main and dependent factor is gasolin...
متن کاملA Set of Algorithms for Solving the Generalized Tardiness Flowshop Problems
This paper considers the problem of scheduling n jobs in the generalized tardiness flow shop problem with m machines. Seven algorithms are developed for finding a schedule with minimum total tardiness of jobs in the generalized flow shop problem. Two simple rules, the shortest processing time (SPT), and the earliest due date (EDD) sequencing rules, are modified and employed as the core of seque...
متن کاملUsing a Data Mining Tool and FP-Growth Algorithm Application for Extraction of the Rules in two Different Dataset (TECHNICAL NOTE)
In this paper, we want to improve association rules in order to be used in recommenders. Recommender systems present a method to create the personalized offers. One of the most important types of recommender systems is the collaborative filtering that deals with data mining in user information and offering them the appropriate item. Among the data mining methods, finding frequent item sets and ...
متن کاملMining Multiple-Level Association Rules in Large Databases
ÐA top-down progressive deepening method is developed for efficient mining of multiple-level association rules from large transaction databases based on the Apriori principle. A group of variant algorithms is proposed based on the ways of sharing intermediate results, with the relative performance tested and analyzed. The enforcement of different interestingness measurements to find more intere...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002